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Abstract 

I argue that data becomes temporarily interesting by itself to some self-impro- 
ving, but computationally limited, subjective observer once he learns to predict or 
compress the data in a better way, thus making it subjectively simpler and more 
beautiful. Curiosity is the desire to create or discover more non-random, non- 
arbitrary, regular data that is novel and surprising not in the traditional sense of 
Boltzmaim and Shannon but in the sense that it allows for compression progress 
because its regularity was not yet known. This drive maximizes interestingness, the 
first derivative of subjective beauty or compressibility, that is, the steepness of the 
learning curve. It motivates exploring infants, pure mathematicians, composers, 
artists, dancers, comedians, yourself, and (since 1990) artificial systems. 

First version of this preprint published 23 Dec 2008; revised 15 April 2009. Short 
version: f9Tl. Long version: l[90]l . We distill some of the essential ideas in earlier 
work (1990-2008) on this subject: /l53 |5l] |67] |59] [6^ UM Wl^^ and especially 
recent papers M\WM^MS- 
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1 Store & Compress & Reward Compression Progress 

If the history of the entire universe were computable B123III241 . and there is no evi- 
dence against this possibility |84|, then its simplest explanation would be the shortest 
program that computes it 1 65 , 70 1 . Unfortunately there is no general way of finding the 
shortest program computing any given data 1341 11061 [107 37 1. Therefore physicists 
have traditionally proceeded incrementally, analyzing just a small aspect of the world 
at any given time, trying to find simple laws that allow for describing their limited 
observations better than the best previously known law, essentially trying to find a pro- 
gram that compresses the observed data better than the best previously known program. 
For example, Newton's law of gravity can be formulated as a short piece of code which 
allows for substantially compressing many observation sequences involving falling ap- 
ples and other objects. Although its predictive power is limited — for example, it does 
not explain quantum fluctuations of apple atoms — it still allows for greatly reducing the 
number of bits required to encode the data stream, by assigning short codes to events 
that are predictable with high probability |28 1 under the assumption that the law holds. 
Einstein's general relativity theory yields additional compression progress as it com- 
pactly explains many previously unexplained deviations from Newton's predictions. 

Most physicists believe there is still room for further advances. Physicists, however, 
are not the only ones with a desire to improve the subjective compressibility of their 
observations. Since short and simple explanations of the past usually reflect some 
repetitive regularity that helps to predict the future as well, every intelligent system 
interested in achieving future goals should be motivated to compress the history of raw 
sensory inputs in response to its actions, simply to improve its ability to plan ahead. 

A long time ago, Piaget [49| already explained the explorative learning behav- 
ior of children through his concepts of assimilation (new inputs are embedded in old 
schemas — this may be viewed as a type of compression) and accommodation (adapting 
an old schema to a new input — this may be viewed as a type of compression improve- 
ment), but his informal ideas did not provide enough formal details to permit computer 
implementations of his concepts. How to model a compression progress drive in arti- 
ficial systems? Consider an active agent interacting with an initially unknown world. 
We may use our general Reinforcement Learning (RL) framework of artificial curiosity 
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( 1 990-2008) El EH EI] Il9l TM "eSllTlllTglETllMlllTllllto make the agent dis- 
cover data that allows for additional compression progress and improved predictability. 
The framework directs the agent towards a better understanding the world through ac- 
tive exploration, even when external reward is rare or absent, through intrinsic reward 
or curiosity reward for actions leading to discoveries of previously unknown regulari- 
ties in the action-dependent incoming data stream. 

1.1 Outline 

Section [TT2l will informally describe our algorithmic framework based on: (1) a contin- 
ually improving predictor or compressor of the continually growing data history, (2) a 
computable measure of the compressor's progress (to calculate intrinsic rewards), (3) a 
reward optimizer or reinforcement learner translating rewards into action sequences ex- 
pected to maximize future reward. The formal details are left to the Appendix, which 
will elaborate on the underlying theoretical concepts and describe discrete time im- 
plementations. Section 11.31 will discuss the relation to external reward (external in 
the sense of: originating outside of the brain which is controlling the actions of its 
"external" body). Section |2] will informally show that many essential ingredients of 
intelligence and cognition can be viewed as natural consequences of our framework, 
for example, detection of novelty & surprise & interestingness, unsupervised shifts of 
attention, subjective perception of beauty, curiosity, creativity, art, science, music, and 
jokes. In particular, we reject the traditional Boltzmann / Shannon notion of surprise, 
and demonstrate that both science and art can be regarded as by-products of the desire 
to create / discover more data that is compressible in hitherto unknown ways. Section 
[3] will give an overview of previous concrete implementations of approximations of 
our framework. Section|4]will apply the theory to images tailored to human observers, 
illustrating the rewarding learning process leading from less to more subjective com- 
pressibility. Section|5]will outline how to improve our previous implementations, and 
how to further test predictions of our theory in psychology and neuroscience. 

1.2 Algorithmic Framework 

The basic ideas are embodied by the following set of simple algorithmic principles 
distilling some of the essential ideas in previous publications on this topic ifSTllSSllMl 
|5£M JM M]|72l|76l|8i i8,ill9J. As mentioned above, formal details are left to 
the Appendix. As discussed in Section |2] the principles at least qualitatively explain 
many aspects of intelligent agents such as humans. This encourages us to implement 
and evaluate them in cognitive robots and other artificial systems. 

1 . Store everything. During interaction with the world, store the entire raw history 
of actions and sensory observations including reward signals — the data is holy as 
it is the only basis of all that can be known about the world. To see that full data 
storage is not unrealistic: A human lifetime rarely lasts much longer than 3 x 10^ 
seconds. The human brain has roughly 10^° neurons, each with 10* synapses on 
average. Assuming that only half of the brain's capacity is used for storing raw 
data, and that each synapse can store at most 6 bits, there is still enough capacity 
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to encode the lifelong sensory input stream with a rate of roughly 10^ bits/s, 
comparable to the demands of a movie with reasonable resolution. The storage 
capacity of affordable technical systems will soon exceed this value. If you can 
store the data, do not throw it away ! 

2. Improve subjective compressibility. In principle, any regularity in the data 
history can be used to compress it. The compressed version of the data can be 
viewed as its simplifying explanation. Thus, to better explain the world, spend 
some of the computation time on an adaptive compression algorithm trying to 
partially compress the data. For example, an adaptive neural network |8 | may 
be able to learn to predict or postdict some of the historic data from other his- 
toric data, thus incrementally reducing the number of bits required to encode the 
whole. See Appendix lA.3l and lA.5l 

3. Let intrinsic curiosity reward reflect compression progress. The agent should 
monitor the improvements of the adaptive data compressor: whenever it learns to 
reduce the number of bits required to encode the historic data, generate an intrin- 
sic reward signal or curiosity reward signal in proportion to the learning progress 
or compression progress, that is, the number of saved bits. See Appendix IA.5I 
andlA^l 

4. Maximize intrinsic curiosity reward ll57ll58llMll59ll60l[l^l68l 172117611^1 [8^ 
[87l . Let the action selector or controller use a general Reinforcement Learning 
(RL) algorithm (which should be able to observe the current state of the adaptive 
compressor) to maximize expected reward, including intrinsic curiosity reward. 
To optimize the latter, a good RL algorithm will select actions that focus the 
agent's attention and learning capabilities on those aspects of the world that allow 
for finding or creating new, previously unknown but learnable regularities. In 
other words, it will try to maximize the steepness of the compressor's learning 
curve. This type of active unsupervised learning can help to figure out how the 
world works. See Appendix lA^ [Ol \A3\ lAlOl 

The framework above essentially specifies the objectives of a curious or creative 
system, not the way of achieving the objectives through the choice of a particular 
adaptive compressor or predictor and a particular RL algorithm. Some of the possi- 
ble choices leading to special instances of the framework (including previous concrete 
implementations) will be discussed later. 

1.3 Relation to External Reward 

Of course, the real goal of many cognitive systems is not just to satisfy their curiosity, 
but to solve externally given problems. Any formalizable problem can be phrased as an 
RL problem for an agent living in a possibly unknown environment, trying to maximize 
the future external reward expected until the end of its possibly finite lifetime. The new 
millennium brought a few extremely general, even universal RL algorithms (universal 
problem solvers or universal artificial intelligences — see Appendix lA. 81 |Al9l l that are 
optimal in various theoretical but not necessarily practical senses, e. g., 1291 l79l [82l 
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[83l [86l [85] |92l ■ To the extent that learning progress / compression progress / curiosity 
as above are helpful, these universal methods will automatically discover and exploit 
such concepts. Then why bother at all writing down an explicit framework for active 
curiosity-based experimentation? 

One answer is that the present universal approaches sweep under the carpet certain 
problem-independent constant slowdowns, by burying them in the asymptotic notation 
of theoretical computer science. They leave open an essential remaining question: 
If the agent can execute only a fixed number of computational instructions per unit 
time interval (say, 10 trillion elementary operations per second), what is the best way 
of using them to get as close as possible to the recent theoretical limits of universal 
AIs, especially when external rewards are very rare, as is the case in many realistic 
environments? The premise of this paper is that the curiosity drive is such a general 
and generally useful concept for limited-resource RL in rare-reward environments that 
it should be prewired, as opposed to be learnt from scratch, to save on (constant but 
possibly still huge) computation time. An inherent assumption of this approach is that 
in realistic worlds a better explanation of the past can only help to better predict the 
future, and to accelerate the search for solutions to externally given tasks, ignoring the 
possibility that curiosity may actually be harmful and "kill the cat." 

2 Consequences of the Compression Progress Drive 

Let us discuss how many essential ingredients of intelligence and cognition can be 
viewed as natural by-products of the principles above. 

2.1 Compact Internal Representations or Symbols as By-Products 
of Efficient History Compression 

To compress the history of observations so far, the compressor (say, a predictive neural 
network) will automatically create internal representations or symbols (for example, 
patterns across certain neural feature detectors) for things that frequently repeat them- 
selves. Even when there is limited predictability, efficient compression can still be 
achieved by assigning short codes to events that are predictable with high probability 
ll28ll95l . For example, the sun goes up every day. Hence it is efficient to create internal 
symbols such as daylight to describe this repetitive aspect of the data history by a short 
reusable piece of internal code, instead of storing just the raw data. In fact, predictive 
neural networks are often observed to create such internal (and hierarchical) codes as a 
by-product of minimizing their prediction error on the training data. 

2.2 Consciousness as a Particular By-Product of Compression 

There is one thing that is involved in all actions and sensory inputs of the agent, namely, 
the agent itself. To efficiently encode the entire data history, it will profit from creating 
some sort of internal symbol or code (e. g., a neural activity pattern) representing 
the agent itself. Whenever this representation is actively used, say, by activating the 
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corresponding neurons through new incoming sensory inputs or otherwise, the agent 
could be called self-aware or conscious. 

This straight-forward explanation apparently does not abandon any essential as- 
pects of our intuitive concept of consciousness, yet seems substantially simpler than 
other recent views ini2l ll05T 1 1011 125111211 . In the rest of this paper we will not have to 
attach any particular mystic value to the notion of consciousness — in our view, it is just 
a natural by-product of the agent's ongoing process of problem solving and world mod- 
eling through data compression, and will not play a prominent role in the remainder of 
this paper 

2.3 The Lazy Brain's Subjective, Time-Dependent Sense of Beauty 

Let 0{t) denote the state of some subjective observer O at time t. According to our lazy 
brain theory [|62l|66i|69l|8l]|82l|88j, we may identify the subjective beauty B{D, 0{t)) 
of a new observation D (but not its interestingness - see Section 12741 ) as being propor- 
tional to the number of bits required to encode D, given the observer's limited previous 
knowledge embodied by the current state of its adaptive compressor For example, to 
efficiently encode previously viewed human faces, a compressor such as a neural net- 
work may find it useful to generate the internal representation of a prototype face. To 
encode a new face, it must only encode the deviations from the prototype |67|. Thus 
a new face that does not deviate much from the prototype llT7ll48ll will be subjectively 
more beautiful than others. Similarly for faces that exhibit geometric regularities such 
as symmetries or simple proportions 169. .88,1 — in principle, the compressor may ex- 
ploit any regularity for reducing the number of bits required to store the data. 

Generally speaking, among several sub-patterns classified as comparable by a given 
observer, the subjectively most beautiful is the one with the simplest (shortest) descrip- 
tion, given the observer's current particular method for encoding and memorizing it 
ll67l l69ll . For example, mathematicians find beauty in a simple proof with a short 
description in the formal language they are using. Others like geometrically simple, 
aesthetically pleasing, low-complexity drawings of various objects Il67ll69ll . 

This immediately explains why many human observers prefer faces similar to their 
own. What they see every day in the mirror will influence their subjective prototype 
face, for simple reasons of coding efficiency. 

2.4 Subjective Interestingness as First Derivative of Subjective 
Beauty: The Steepness of the Learning Curve 

What's beautiful is not necessarily interesting. A beautiful thing is interesting only as 
long as it is new, that is, as long as the algorithmic regularity that makes it simple has 
not yet been fully assimilated by the adaptive observer who is still learning to compress 
the data better It makes sense to define the time-dependent subjective Interestingness 
I{D, 0(t)) of data D relative to observer O at time t by 
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the first derivative of subjective beauty: as the learning agent improves its compression 
algorithm, formerly apparently random data parts become subjectively more regular 
and beautiful, requiring fewer and fewer bits for their encoding. As long as this process 
is not over the data remains interesting and rewarding. The Appendix and Section[3]on 
previous implementations will describe details of discrete time versions of this concept. 
See also |59 60 108 68]|72l|76l|8l]|88ll83. 

2.5 Pristine Beauty & Interestingness vs External Rewards 

Note that our above concepts of beauty and interestingness are limited and pristine 
in the sense that they are not a priori related to pleasure derived from external re- 
wards (compare Section [T3] ). For example, some might claim that a hot bath on a cold 
day triggers "beautiful" feelings due to rewards for achieving prewired target values 
of external temperature sensors (external in the sense of: outside the brain which is 
controlling the actions of its external body). Or a song may be called "beautiful" for 
emotional (e.g., 1 13 1) reasons by some who associate it with memories of external plea- 
sure through their first kiss. Obviously this is not what we have in mind here — we are 
focusing solely on rewards of the intrinsic type based on learning progress. 

2.6 True Novelty & Surprise vs Traditional Information Theory 

Consider two extreme examples of uninteresting, unsurprising, boring data: A vision- 
based agent that always stays in the dark will experience an extremely compressible, 
soon totally predictable history of unchanging visual inputs. In front of a screen full 
of white noise conveying a lot of information and "novelty" and "surprise" in the tra- 
ditional sense of Boltzmann and Shannon |.102j . however, it will experience highly 
unpredictable and fundamentally incompressible data. In both cases the data is bor- 
ing ll72l [88l as it does not allow for further compression progress. Therefore we re- 
ject the traditional notion of surprise. Neither the arbitrary nor the fully predictable 
is truly novel or surprising — only data with still unknown algorithmic regularities are 

2.7 Attention / Curiosity / Active Experimentation 

In absence of external reward, or when there is no known way to further increase 
the expected external reward, our controller essentially tries to maximize true nov- 
elty or interestingness, the first derivative of subjective beauty or compressibility, the 
steepness of the learning curve. It will do its best to select action sequences expected 
to create observations yielding maximal expected future compression progress, given 
the limitations of both the compressor and the compressor improvement algorithm. 
It will learn to focus its attention ||96l II 161 and its actively chosen experiments on 
things that are currently still incompressible but are expected to become compressible 
/ predictable through additional learning. It will get bored by things that already are 
subjectively compressible. It will also get bored by things that are currently incom- 
pressible but will apparently remain so, given the experience so far, or where the costs 
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of making them compressible exceed those of making other things compressible, etc. 



2.8 Discoveries 

An unusually large compression breakthrough deserves the name discovery. For exam- 
ple, as mentioned in the introduction, the simple law of gravity can be described by a 
very short piece of code, yet it allows for greatly compressing all previous observations 
of falling apples and other objects. 

2.9 Beyond Standard Unsupervised Learning 

Traditional unsupervised learning is about finding regularities, by clustering the data, 
or encoding it through a factorial code IH l64l with statistically independent compo- 
nents, or predicting parts of it from other parts. All of this may be viewed as special 
cases of data compression. For example, where there are clusters, a data point can be 
efficiently encoded by its cluster center plus relatively few bits for the deviation from 
the center. Where there is data redundancy, a non-redundant factorial code ||64l will 
be more compact than the raw data. Where there is predictability, compression can be 
achieved by assigning short codes to those parts of the observations that are predictable 
from previous observations with high probability ||281|95]| . Generally speaking we may 
say that a major goal of traditional unsupervised learning is to improve the compression 
of the observed data, by discovering a program that computes and thus explains the his- 
tory (and hopefully does so quickly) but is clearly shorter than the shortest previously 
known program of this kind. 

Traditional unsupervised learning is not enough though — it just analyzes and en- 
codes the data but does not choose it. We have to extend it along the dimension of 
active action selection, since our unsupervised learner must also choose the actions 
that influence the observed data, just like a scientist chooses his experiments, a baby its 
toys, an artist his colors, a dancer his moves, or any attentive system ||96l its next sen- 
sory input. That's precisely what is achieved by our RL-based framework for curiosity 
and creativity. 

2.10 Art & Music as By-Products of the Compression Progress 
Drive 

Works of art and music may have important purposes beyond their social aspects f3\ 
despite of those who classify art as superfluous |50|. Good observer-dependent art 
deepens the observer's insights about this world or possible worlds, unveiling previ- 
ously unknown regularities in compressible data, connecting previously disconnected 
patterns in an initially surprising way that makes the combination of these patterns 
subjectively more compressible (art as an eye-opener), and eventually becomes known 
and less interesting. I postulate that the active creation and attentive perception of all 
kinds of artwork are just by-products of our principle of interestingness and curiosity 
yielding reward for compressor improvements. 
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Let us elaborate on this idea in more detail, following the discussion in lISTl [88l . 
Artificial or human observers must perceive art sequentially, and typically also actively, 
e.g., through a sequence of attention-shifting eye saccades or camera movements scan- 
ning a sculpture, or internal shifts of attention that filter and emphasize sounds made by 
a pianist, while surpressing background noise. Undoubtedly many derive pleasure and 
rewards from perceiving works of art, such as certain paintings, or songs. But differ- 
ent subjective observers with different sensory apparati and compressor improvement 
algorithms will prefer different input sequences. Hence any objective theory of what 
is good art must take the subjective observer as a parameter, to answer questions such 
as: Which sequences of actions and resulting shifts of attention should he execute to 
maximize his pleasure? According to our principle he should select one that maximizes 
the quickly learnable compressibility that is new, relative to his current knowledge and 
his (usually Umited) way of incorporating / learning / compressing new data. 

2.11 Music 

For example, which song should some human observer select next? Not the one he 
just heard ten times in a row. It became too predictable in the process. But also not 
the new weird one with the completely unfamiliar rhythm and tonality. It seems too 
irregular and contain too much arbitrariness and subjective noise. He should try a song 
that is unfamiliar enough to contain somewhat unexpected harmonies or melodies or 
beats etc., but familiar enough to allow for quickly recognizing the presence of a new 
learnable regularity or compressibility in the sound stream. Sure, this song will get 
boring over time, but not yet. 

The observer dependence is illustrated by the fact that Schonberg's twelve tone 
music is less popular than certain pop music tunes, presumably because its algorithmic 
structure is less obvious to many human observers as it is based on more complicated 
harmonies. For example, frequency ratios of successive notes in twelve tone music 
often cannot be expressed as fractions of very small integers. Those with a prior ed- 
ucation about the basic concepts and objectives and constraints of twelve tone music, 
however, tend to appreciate Schonberg more than those without such an education. 

All of this perfectly fits our principle: The learning algorithm of the compressor 
of a given subjective observer tries to better compress his history of acoustic and other 
inputs where possible. The action selector tries to find history-influencing actions that 
help to improve the compressor's performance on the history so far. The interesting 
musical and other subsequences are those with previously unknown yet learnable types 
of regularities, because they lead to compressor improvements. The boring patterns are 
those that seem arbitrary or random, or whose structure seems too hard to understand. 

2.12 Paintings, Sculpture, Dance, Film etc. 

Similar statements not only hold for other dynamic art including film and dance (taking 
into account the compressibility of controller actions), but also for painting and sculp- 
ture, which cause dynamic pattern sequences due to attention- shifting actions Ii96.,116i 
of the observer. 
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2.13 No Objective "Ideal Ratio" Between Expected and Unexpected 



Some of the previous attempts at explaining aesthetic experiences in the context of 
information theory ||7] |4T] [6] |44l emphasized the idea of an "ideal" ratio between 
expected and unexpected information conveyed by some aesthetic object (its "order" 
vs its "complexity"). Note that our alternative approach does not have to postulate 
an objective ideal ratio of this kind. Instead our dynamic measure of interestingness 
reflects the change in the number of bits required to encode an object, and explicitly 
takes into account the subjective observer's prior knowledge as well as the hmitations 
of its compression improvement algorithm. 

2.14 Blurred Boundary Between Active Creative Artists and Pas- 
sive Perceivers of Art 

Just as observers get intrinsic rewards for sequentially focusing attention on artwork 
that exhibits new, previously unknown regularities, the creative artists get reward for 
making it. For example, 1 found it extremely rewarding to discover (after hundreds of 
frustrating failed attempts) the simple geometric regularities that permitted the con- 
struction of the drawings in Figures [T] and |2] The distinction between artists and 
observers is blurred though. Both execute action sequences to exhibit new types of 
compressibility. The intrinsic motivations of both are fully compatible with our simple 
principle. 

Some artists, of course, crave external reward from other observers, in form of 
praise, money, or both, in addition to the intrinsic compression improvement-based 
reward that comes from creating a truly novel work of art. Our principle, however, 
conceptually separates these two reward types. 

2.15 How Artists and Scientists are Alike 

From our perspective, scientists are very much like artists. They actively select experi- 
ments in search for simple but new laws compressing the resulting observation history. 
In particular, the creativity of painters, dancers, musicians, pure mathematicians, physi- 
cists, can be viewed as a mere by-product of our curiosity framework based on the com- 
pression progress drive. All of them try to create new but non-random, non-arbitrary 
data with surprising, previously unknown regularities. For example, many physicists 
invent experiments to create data governed by previously unknown laws allowing to 
further compress the data. On the other hand, many artists combine well-known ob- 
jects in a subjectively novel way such that the observer's subjective description of the 
result is shorter than the sum of the lengths of the descriptions of the parts, due to some 
previously unnoticed regularity shared by the parts. 

What is the main difference between science and art? The essence of science is to 
formally nail down the nature of compression progress achieved through the discovery 
of a new regularity. For example, the law of gravity can be described by just a few 
symbols. In the fine arts, however, compression progress achieved by observing an 
artwork combining previously disconnected things in a new way (art as an eye-opener) 
may be iM^JConscious and not at all formally describable by the observer, who may/eeZ 
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the progress in terms of intrinsic reward without being able to say exactly which of his 
memories became more subjectively compressible in the process. 

The framework in the appendix is sufficiently formal to allow for implementation 
of our principle on computers. The resulting artificial observers will vary in terms of 
the computational power of their history compressors and learning algorithms. This 
will influence what is good art / science to them, and what they find interesting. 

2.16 Jokes and Other Sources of Fun 

Just like other entertainers and artists, comedians also tend to combine well-known 
concepts in a novel way such that the observer's subjective description of the result 
is shorter than the sum of the lengths of the descriptions of the parts, due to some 
previously unnoticed regularity shared by the parts. 

In many ways the laughs provoked by witty jokes are similar to those provoked by 
the acquisition of new skills through both babies and adults. Past the age of 25 I learnt 
to juggle three balls. It was not a sudden process but an incremental and rewarding 
one: in the beginning 1 managed to juggle them for maybe one second before they fell 
down, then two seconds, four seconds, etc., until 1 was able to do it right. Watching 
myself in the mirror (as recommended by juggling teachers) I noticed an idiotic grin 
across my face whenever I made progress. Later my little daughter grinned just like 
that when she was able to stand on her own feet for the first time. All of this makes 
perfect sense within our algorithmic framework: such grins presumably are triggered 
by intrinsic reward for generating a data stream with previously unknown regularities, 
such as the sensory input sequence corresponding to observing oneself juggling, which 
may be quite different from the more familiar experience of observing somebody else 
juggling, and therefore truly novel and intrinsically rewarding, until the adaptive pre- 
dictor / compressor gets used to it. 

3 Previous Concrete Implementations of Systems Driven 
by (Approximations of) Compression Progress 

As mentioned earlier, predictors and compressors are closely related. Any type of par- 
tial predictability of the incoming sensory data stream can be exploited to improve the 
compressibility of the whole. Therefore the systems described in the first publications 
on artificial curiosity ||57l |58l |6T1 already can be viewed as examples of implementa- 
tions of a compression progress drive. 

3.1 Reward for Prediction Error (1990) 

Early work ||57l |58l |6T| described a predictor based on a recurrent neural network 
Ml 151 11201 [55] |62l |47l |78l (in principle a rather powerful computational device, even 
by today's machine learning standards), predicting sensory inputs including reward 
signals from the entire history of previous inputs and actions. The curiosity rewards 
were proportional to the predictor errors, that is, it was implicitly and optimistically 
assumed that the predictor will indeed improve whenever its error is high. 
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3.2 Reward for Compression Progress Through Predictor Improve- 
ments (1991) 

Follow-up work "SOl pointed out that this approach may be inappropriate, espe- 
cially in probabilistic environments: one should not focus on the errors of the predic- 
tor, but on its improvements. Otherwise the system will concentrate its search on those 
parts of the environment where it can always get high prediction errors due to noise or 
randomness, or due to computational limitations of the predictor, which will prevent 
improvements of the subjective compressibility of the data. While the neural predic- 
tor of the implementation described in the follow-up work was indeed computationally 
less powerful than the previous one |61 1, there was a novelty, namely, an explicit (neu- 
ral) adaptive model of the predictor's improvements. This model essentially learned to 
predict the predictor's changes. For example, although noise was unpredictable and led 
to wildly varying target signals for the predictor, in the long run these signals did not 
change the adaptive predictor parameters much, and the predictor of predictor changes 
was able to learn this. A standard RL algorithm 11 141 [33l 1 1091 was fed with curiosity 
reward signals proportional to the expected long-term predictor changes, and thus tried 
to maximize information gain ifl^ [31] [38l IST] [T4l within the given limitations. In fact, 
we may say that the system tried to maximize an approximation of the (discounted) 
sum of the expected first derivatives of the data's subjective predictability, thus also 
maximizing an approximation of the (discounted) sum of the expected changes of the 
data's subjective compressibility. 

3.3 Reward for Relative Entropy between Agent's Prior and Pos- 
terior (1995) 

Additional follow-up work yielded an information theory-oriented variant of the ap- 
proach in non-deterministic worlds 1 108| (1995). The curiosity reward was again 
proportional to the predictor's surprise / information gain, this time measured as the 
Kullback-Leibler distance lf35l between the learning predictor's subjective probabiUty 
distributions before and after new observations - the relative entropy between its prior 
and posterior. 

In 2005 Baldi and Itti called this approach "Bayesian surprise" and demonstrated 
experimentally that it explains certain patterns of human visual attention better than 
certain previous approaches [32J. 

Note that the concepts of Huffman coding ll28l and relative entropy between prior 
and posterior immediately translate into a measure of learning progress reflecting the 
number of saved bits — a measure of improved data compression. 

Note also, however, that the naive probabilistic approach to data compression is 
unable to discover more general types of algorithmic compressibility O106ll34l[37ll73l . 
For example, the decimal expansion of tt looks random and incompressible but isn't: 
there is a very short algorithm computing all of tt, yet any finite sequence of digits 
will occur in tt's expansion as frequently as expected if tt were truly random, that is, 
no simple statistical learner will outperform random guessing at predicting the next 
digit from a limited time window of previous digits. More general program search 
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techniques (e.g., ll36ll75l[T5ll46ll ') are necessary to extract the underlying algorithmic 
regularity. 

3.4 Zero Sum Reward Games for Compression Progress Revealed 
by Algorithmic Experiments (1997) 

More recent work 16811721 (1997) greatly increased the computational power of con- 
troller and predictor by implementing them as co-evolving, symmetric, opposing mod- 
ules consisting of self-modifying probabilistic programs |97 98 1 written in a universal 
programming language IfTSlll 1 111 allowing for loops, recursion, and hierarchical struc- 
tures. The internal storage for temporary computational results of the programs was 
viewed as part of the changing environment. Each module could suggest experiments 
in the form of probabilistic algorithms to be executed, and make confident predictions 
about their effects by betting on their outcomes, where the 'betting money' essentially 
played the role of the intrinsic reward. The opposing module could reject or accept 
the bet in a zero-sum game by making a contrary prediction. In case of acceptance, 
the winner was determined by executing the algorithmic experiment and checking its 
outcome; the money was eventually transferred from the surprised loser to the con- 
firmed winner. Both modules tried to maximize their money using a rather general RL 
algorithm designed for complex stochastic policies |97 , 98 1 (alternative RL algorithms 
could be plugged in as well). Thus both modules were motivated to discover truly novel 
algorithmic regularity / compressibility, where the subjective baseline for novelty was 
given by what the opponent already knew about the world's repetitive regularities. 

The method can be viewed as system identification through co-evolution of com- 
putable models and tests. In 2005 a similar co-evolutionary approach based on less 
general models and tests was implemented by Bongard and Lipson fTTl . 

3.5 Improving Real Reward Intake 

Our references above demonstrated experimentally that the presence of intrinsic reward 
or curiosity reward actually can speed up the collection of external reward. 

3.6 Other Implementations 

Recently several researchers also implemented variants or approximations of the cu- 
riosity framework. Singh and Barto and coworkers focused on implementations within 
the option framework of RL |5, 104|, directly using prediction errors as curiosity re- 
wards as in Section [3T| 1,57. .58. .6 1 J — they actually were the ones who coined the ex- 
pressions intrinsic reward and intrinsically motivated RL. Additional implementations 
were presented at the 2005 AAAI Spring Symposium on Developmental Robotics ||9l ; 
compare the Connection Science Special Issue [lOJ . 
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4 Visual Illustrations of Subjective Beauty and its First 
Derivative Interestingness 



As mentioned above ("Section |33] |. the probabilistic variant of our theory II 10811 (1995) 
was able to explain certain shifts of human visual attention 1 32] (2005). But we can also 
apply our approach to the complementary problem of constructing images that contain 
quickly learnable regularities, arguing again that there is no fundamental difference 
between the motivation of creative artists and passive observers of visual art (Section 
12.141 ). Both create action sequences yielding interesting inputs, where interestingness 
is a measure of learning progress, for example, based on the relative entropy between 
prior and posterior (Section [33] ). or the saved number of bits needed to encode the data 
(Section[rii, or something similar (Section[3]). 

Here we provide examples of subjective beauty tailored to human observers, and 
illustrate the learning process leading from less to more subjective beauty. Due to 
the nature of the present written medium, we have to use visual examples instead of 
acoustic or tactile ones. Our examples are intended to support the hypothesis that unsu- 
pervised attention and the creativity of artists, dancers, musicians, pure mathematicians 
are just by-products of their compression progress drives. 

4.1 A Pretty Simple Face with a Short Algorithmic Description 

Figure [T] depicts the construction plan of a female face considered 'beautiful' by some 
human observers. It also shows that the essential features of this face follow a very 
simple geometrical pattern [69] that can be specified by very few bits of information. 
That is, the data stream generated by observing the image (say, through a sequence 
of eye saccades) is more compressible than it would be in the absence of such regu- 
larities. Although few people are able to immediately see how the drawing was made 
in absence of its superimposed grid-based explanation, most do notice that the facial 
features somehow fit together and exhibit some sort of regularity. According to our 
postulate, the observer's reward is generated by the conscious or subconscious discov- 
ery of this compressibility. The face remains interesting until its observation does not 
reveal any additional previously unknown regularities. Then it becomes boring even in 
the eyes of those who think it is beautiful — as has been pointed out repeatedly above, 
beauty and interestingness are two different things. 

4.2 Another Drawing That Can Be Encoded By Very Few Bits 

Figure |2] provides another example: a butterfly and a vase with a flower. It can be 
specified by very few bits of information as it can be constructed through a very simple 
procedure or algorithm based on fractal circle patterns [67 1 — see Figure|3] People who 
understand this algorithm tend to appreciate the drawing more than those who do not. 
They realize how simple it is. This is not an immediate, all-or-nothing, binary process 
though. Since the typical human visual system has a lot of experience with circles, most 
people quickly notice that the curves somehow fit together in a regular way. But few 
are able to immediately state the precise geometric principles underlying the drawing 
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lISTl . This pattern, however, is learnable from Figure[3] The conscious or subconscious 
discovery process leading from a longer to a shorter description of the data, or from 
less to more compression, or from less to more subjectively perceived beauty, yields 
reward depending on the first derivative of subjective beauty, that is, the steepness of 
the learning curve. 

5 Conclusion & Outlook 

We pointed out that a surprisingly simple algorithmic principle based on the notions 
of data compression and data compression progress informally explains fundamen- 
tal aspects of attention, novelty, surprise, interestingness, curiosity, creativity, subjec- 
tive beauty, jokes, and science & art in general. The crucial ingredients of the corre- 
sponding /orma/ framework are (1) a continually improving predictor or compressor 
of the continually growing data history, (2) a computable measure of the compres- 
sor's progress (to calculate intrinsic rewards), (3) a reward optimizer or reinforce- 
ment learner translating rewards into action sequences expected to maximize future 
reward. To improve our previous implementations of these ingredients (Section O, 
we will (1) study better adaptive compressors, in particular, recent, novel RNNs ll94l 
and other general but practically feasible methods for making predictions ifTSll : (2) in- 
vestigate under which conditions learning progress measures can be computed both 
accurately and efficiently, without frequent expensive compressor performance evalu- 
ations on the entire history so far; (3) study the applicability of recent improved RL 
techniques in the fields of policy gradients GTOl HH Hill |56l [lOOl HTTl, artificial 
evolution ||43l|20l|2l][l9l|22l|23]|24l, and others fTTllTSl. 

Apart from building improved artificial curious agents, we can test the predictions 
of our theory in psychological investigations of human behavior, extending previous 
studies in this vein fy2\ and going beyond anecdotal evidence mentioned above. It 
should be easy to devise controlled experiments where test subjects must anticipate 
initially unknown but causally connected event sequences exhibiting more or less com- 
plex, learnable patterns or regularities. The subjects will be asked to quantify their in- 
trinsic rewards in response to their improved predictions. Is the reward indeed strongest 
when the predictions are improving most rapidly? Does the intrinsic reward indeed 
vanish as the predictions become perfect or do not improve any more? 

Finally, how to test our predictions through studies in neuroscience? Currently 
we hardly understand the human neural machinery. But it is well-known that certain 
neurons seem to predict others, and brain scans show how certain brain areas light 
up in response to reward. Therefore the psychological experiments suggested above 
should be accompanied by neurophysiological studies to localize the origins of intrinsic 
rewards, possibly linking them to improvements of neural predictors. 

Success in this endeavor would provide additional motivation to implement our 
principle on robots. 
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A Appendix 



This appendix is based in part on references lISTllSSl . 

The world can be explained to a degree by compressing it. Discoveries correspond 
to large data compression improvements (found by the given, application-dependent 
compressor improvement algorithm). How to build an adaptive agent that not only 
tries to achieve externally given rewards but also to discover, in an unsupervised and 
experiment-based fashion, explainable and compressible data? (The explanations gained 
through explorative behavior may eventually help to solve teacher-given tasks.) 

Let us formally consider a learning agent whose single life consists of discrete 
cycles or time steps t — 1,2, ... ,T. Its complete lifetime T may or may not be 
known in advance. In what follows, the value of any time-varying variable Q at time t 
(I < t < T) will be denoted by Q{t), the ordered sequence of values (3(1), . . . , Q{t) 
by Q{<t), and the (possibly empty) sequence (5(1), . . . , Q{t — 1) by Q{< t). At any 
given t the agent receives a real- valued input x{t) from the environment and executes 
a real-valued action y{t) which may affect future inputs. At times i < T its goal is to 
maximize future success or utility 



where r(t) is an additional real-valued reward input at time t, h{t) the ordered triple 
[x{t),y{t), r{t)] (hence h{< t) is the known history up to t), and Ef^{- \ •) denotes the 
conditional expectation operator with respect to some possibly unknown distribution 
/i from a set M of possible distributions. Here A4 reflects whatever is known about 
the possibly probabilistic reactions of the environment. For example, M may con- 
tain all computable distributions I 106lll07ll37ll29l . There is just one life, no need for 
predefined repeatable trials, no restriction to Markovian interfaces between sensors and 
environment, and the utility function implicitly takes into account the expected remain- 
ing lifespan E^{T \ h{< t)) and thus the possibility to extend it through appropriate 
actions Il79l [821 [8011921. 

Recent work has led to the first learning machines that are universal and optimal in 
various very general senses 1291 [79l [82l . As mentioned in the introduction, such ma- 
chines can in principle find out by themselves whether curiosity and world model con- 
struction are useful or useless in a given environment, and learn to behave accordingly. 
The present appendix, however, will assume a priori that compression / explanation of 
the history is good and should be done; here we shall not worry about the possibility 
that curiosity can be harmful and "kill the cat." Towards this end, in the spirit of our 
previous work since 1990 |l521|5S|6l]|59llMl[I08llMll22l23III]|88l|8^ we split the 
reward signal r{t) into two scalar real-valued components: r{t) = g{rext{t) , 'Tintit)), 
where g maps pairs of real values to real values, e.g., g{a, h) — a + h. Here r^xtit) 
denotes traditional external reward provided by the environment, such as negative re- 
ward in response to bumping against a wall, or positive reward in response to reaching 
some teacher-given goal state. But for the purposes of this paper we are especially 
interested in rint{t), the internal or intrinsic or curiosity reward, which is provided 
whenever the data compressor / internal world model of the agent improves in some 




(2) 
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measurable sense. Our initial focus will be on the case rext{t) = for all valid 
t. The basic principle is essentially the one we published before in various variants 

Principle 1 Generate curiosity reward for the controller in response to improvements 
of the predictor or history compressor 

So we conceptually separate the goal (explaining / compressing the history) from the 
means of achieving the goal. Once the goal is formally specified in terms of an algo- 
rithm for computing curiosity rewards, let the controller's reinforcement learning (RL) 
mechanism figure out how to translate such rewards into action sequences that allow 
the given compressor improvement algorithm to find and exploit previously unknown 
types of compressibility. 

A.l Predictors vs Compressors 

Much of our previous work on artificial curiosity was prediction-oriented, e. g., Il57l 
l58ll6ni59ll60l[T08ll68ll72l l76l. Prediction and compression are closely related though. 
A predictor that correctly predicts many x(r), given history h{< r), for 1 < r < t, can 
be used to encode h{<t) compactly. Given the predictor, only the wrongly predicted 
x{t) plus information about the corresponding time steps r are necessary to reconstruct 
history h{< t), e.g., 16J|. Similarly, a predictor that learns a probability distribution of 
the possible next events, given previous events, can be used to efficiently encode obser- 
vations with high (respectively low) predicted probability by few (respectively many) 
bits 1,28. ,95 J . thus achieving a compressed history representation. Generally speaking, 
we may view the predictor as the essential part of a program p that re-computes h{<t). 
If this program is short in comparison to the raw data h{<t), then h{<t) is regular 
or non-random Ill06l[34l[37i r73 l. presumably reflecting essential environmental laws. 
Then p may also be highly useful for predicting future, yet unseen x{t) for t > t. 

It should be mentioned, however, that the compressor-oriented approach to predic- 
tion based on the principle of Minimum Description Length (MDL) [34, 1 12. 11 13ll54l 
[37II does not necessarily converge to the correct predictions as quickly as Solomonoff 's 
universal inductive inference f 106l ll07l[37l . although both approaches converge in the 
limit under general conditions Ii52l . 

A.2 Which Predictor or History Compressor? 

The complexity of evaluating some compressor p on history h{< t) depends on both p 
and its performance measure C. Let us first focus on the former Given t, one of the 
simplest p will just use a linear mapping to predict x{t + 1) from x{t) and y{t + 1). 
More complex p such as adaptive recurrent neural networks (RNN) Ml 151 11201 l55l 
l62l l47l l26l |93l l77l l78l will use a nonlinear mapping and possibly the entire history 
h{< t) as a basis for the predictions. In fact, the first work on artificial curiosity IMI 
focused on online learning RNN of this type. A theoretically optimal predictor would 
be Solomonoff 's above-mentioned universal induction scheme [il06lll07ll37l . 
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A.3 Compressor Performance Measures 

At any time t (1 < t < T), given some compressor program p able to compress 
history h{< t), let C(p, h{< t)) denote p's compression performance on h{< t). An 
appropriate performance measure would be 

Ci{p,h{<t))^l{p), (3) 

where l{p) denotes the length of p, measured in number of bits: the shorter p, the 
more algorithmic regularity and compressibility and predictability and lawfulness in 
the observations so far The ultimate limit for Ci{p, h{< t)) would be K*{h{< t)), 
a variant of the Kolmogorov complexity of h{< t), namely, the length of the shortest 
program (for the given hardware) that computes an output starting with h{<t) Ml 061 

A.4 Compressor Performance Measures Taking Time Into Account 

Ci{p, h{<t)) does not take into account the time t{p, h{<t)) spent by p on computing 
h{< t). An alternative performance measure inspired by concepts of optimal universal 
search 1361 |75l is 

Clr{p,hi<t))^ lip) + log T{p,h{<t)). (4) 

Here compression by one bit is worth as much as runtime reduction by a factor of i. 
From an asymptotic optimality-oriented point of view this is one of the best ways of 
trading off storage and computation time Il36ll75l . 

A.5 Measures of Compressor Progress / Learning Progress 

The previous sections only discussed measures of compressor performance, but not of 
performance improvement, which is the essential issue in our curiosity-oriented con- 
text. To repeat the point made above: The important thing are the improvements of 
the compressor, not its compression performance per se. Our curiosity reward in re- 
sponse to the compressor's progress (due to some application-dependent compressor 
improvement algorithm) between times t and < + 1 should be 

nnt{t + 1) = /[C(p(<), h{<t + 1)), C{p{t + 1), h{<t + 1))], (5) 

where / maps pairs of real values to real values. Various alternative progress measures 
are possible; most obvious is /(a, 6) = a — b. This corresponds to a discrete time 
version of maximizing the first derivative of subjective data compressibility. 

Note that both the old and the new compressor have to be tested on the same data, 
namely, the history so far 

A.6 Asynchronous Framework for Creating Curiosity Reward 

Let p{t) denote the agent's current compressor program at time t, s{t) its current con- 
troller, and do: 
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Controller: At any time < (1 < t < T) do: 

1. Let s{t) use (parts of) history h{< t) to select and execute y{t + 1). 

2. Observe x{t + 1). 

3. Check if there is non-zero curiosity reward rint + 1) provided by the separate, 
asynchronously running compressor improvement algorithm (see below). If not, 
set Unt{t + 1) = 0. 

4. Let the controller's reinforcement learning (RL) algorithm use /i(< < + 1) in- 
cluding rint{t + 1) (and possibly also the latest available compressed version of 
the observed data — see below) to obtain a new controller s{t + 1), in line with 
objective (|2]i. 

Compressor: Set pnew equal to the initial data compressor. Starting at time 1, repeat 
forever until interrupted by death at time T: 

1. Set poid = Pnew', get current time step t and set hoid = h{<t). 

2. Evaluate poid on hold, to obtain C {poid, hold) (Section lA3T l. This may take many 
time steps. 

3. Let some (application-dependent) compressor improvement algorithm (such as 
a learning algorithm for an adaptive neural network predictor) use hoid to ob- 
tain a hopefully better compressor Pnew (such as a neural net with the same size 
but improved prediction capability and therefore improved compression perfor- 
mance (95]). Although this may take many time steps (and could be partially 
performed during "sleep"), Pnew may not be optimal, due to hmitations of the 
learning algorithm, e.g., local maxima. 

4. Evaluate Pnew on hoid, to obtain C{pnew, hoid)- This may take many time steps. 

5. Get current time step t and generate curiosity reward 

nntir) = f[C{pold, hold), C{pnew, hold)], (6) 

e.g., f{a, b) = a — b; see Section |A3] 

Obviously this asynchronuous scheme may cause long temporal delays between con- 
troller actions and corresponding curiosity rewards. This may impose a heavy burden 
on the controller's RL algorithm whose task is to assign credit to past actions (to in- 
form the controller about beginnings of compressor evaluation processes etc., we may 
augment its input by unique representations of such events). Nevertheless, there are 
RL algorithms for this purpose which are theoretically optimal in various senses, to be 
discussed next. 
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A.7 Optimal Curiosity & Creativity & Focus of Attention 



Our chosen compressor class typically will have certain computational limitations. In 
the absence of any external rewards, we may define optimal pure curiosity behavior 
relative to these limitations; At time t this behavior would select the action that maxi- 
mizes 



Since the true, world-governing probability distribution is unknown, the resulting 
task of the controller's RL algorithm may be a formidable one. As the system is re- 
visiting previously incompressible parts of the environment, some of those will tend 
to become more subjectively compressible, and the corresponding curiosity rewards 
will decrease over time. A good RL algorithm must somehow detect and then predict 
this decrease, and act accordingly. Traditional RL algorithms |33|, however, do not 
provide any theoretical guarantee of optimality for such situations. (This is not to say 
though that sub-optimal RL methods may not lead to success in certain applications; 
experimental studies might lead to interesting insights.) 

Let us first make the natural assumption that the compressor is not super-complex 
such as Kolmogorov's, that is, its output and rint{t) are computable for all t. Is there 
a best possible RL algorithm that comes as close as any other to maximizing objective 
(|7]i? Indeed, there is. Its drawback, however, is that it is not computable in finite time. 
Nevertheless, it serves as a reference point for defining what is achievable at best. 

A.8 Optimal But Incomputable Action Selector 

There is an optimal way of selecting actions which makes use of Solomonoff 's theo- 
retically optimal universal predictors and their Bayesian learning algorithms [i 1061 1 1071 
[37l |29l l30l . The latter only assume that the reactions of the environment are sampled 
from an unknown probability distribution ji contained in a set M of all enumerable 
distributions — compare text after equation (|2]i. More precisely, given an observation 
sequence q{<t) we want to use the Bayes formula to predict the probability of the next 
possible q{t + 1). Our only assumption is that there exists a computer program that 
can take any q(< i) as an input and compute its a priori probability according to the /i 
prior. In general we do not know this program, hence we predict using a mixture prior 
instead: 



a weighted sum of all distributions /i^ G AA,i = 1,2,..., where the sum of the con- 
stant positive weights satisfies Wi < 1. This is indeed the best one can possibly do, 
in a very general sense 11071 |29l . The drawback of the scheme is its incomputability, 
since A4 contains infinitely many distributions. We may increase the theoretical power 
of the scheme by augmenting Ai by certain non-enumerable but limit-computable dis- 
tributions ll73l . or restrict it such that it becomes computable, e.g., by assuming the 
world is computed by some unknown but deterministic computer program sampled 





(8) 
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from the Speed Prior fl4\ which assigns low probability to environments that are hard 
to compute by any method. 

Once we have such an optimal predictor, we can extend it by formally including 
the effects of executed actions to define an optimal action selector maximizing future 
expected reward. At any time t, Hutter's theoretically optimal (yet uncomputable) RL 
algorithm Aixi |29| uses an extended version of Solomonoff's prediction scheme to 
select those action sequences that promise maximal future reward up to some horizon 
T, given the current data h{< t). That is, in cycle t + 1, Aixi selects as its next action 
the first action of an action sequence maximizing ^-predicted reward up to the given 
horizon, appropriately generalizing eq. ([8]). Aixi uses observations optimally |29l : 
the Bayes-optimal policy based on the mixture is self-optimizing in the sense that 
its average utility value converges asymptotically for all /i G to the optimal value 
achieved by the Bayes-optimal policy which knows /i in advance. The necessary 
and sufficient condition is that A4 admits self-optimizing policies. The policy p^ is 
also Pareto-optimal in the sense that there is no other policy yielding higher or equal 
value in all environments ly E A4 and a strictly higher value in at least one ||29l . 

A.9 A Computable Selector of Provably Optimal Actions 

Aixi above needs unlimited computation time. Its computable variant AlXl(t,l) 1291 
has asymptotically optimal runtime but may suffer from a huge constant slowdown. To 
take the consumed computation time into account in a general, optimal way, we may 
use the recent Godel machines ll79l[82l[80ll92J instead. They represent the first class of 
mathematically rigorous, fully self-referential, self-improving, general, optimally effi- 
cient problem solvers. They are also applicable to the problem embodied by objective 
0. 

The initial software S of such a Godel machine contains an initial problem solver, 
e.g., some typically sub-optimal method |33|. It also contains an asymptotically opti- 
mal initial proof searcher based on an online variant of Levin's Universal Search I 36J , 
which is used to run and test proof techniques. Proof techniques are programs written 
in a universal language implemented on the Godel machine within S. They are in prin- 
ciple able to compute proofs concerning the system's own future performance, based 
on an axiomatic system A encoded in S. A describes the formal utility function, in our 
case eq. the hardware properties, axioms of arithmetic and probability theory and 
data manipulation etc, and S itself, which is possible without introducing circularity 

Inspired by Kurt Godel's celebrated self-referential formulas (1931), the Godel ma- 
chine rewrites any part of its own code (including the proof searcher) through a self- 
generated executable program as soon as its Universal Search variant has found a proof 
that the rewrite is useful according to objective (|7|. According to the Global Optimal- 
ity Theorem |79ii82ij80u92il , such a self-rewrite is globally optimal — no local maxima 
possible! — since the self-referential code first had to prove that it is not useful to con- 
tinue the search for alternative self-rewrites. 

If there is no provably useful optimal way of rewriting S at all, then humans will not 
find one either. But if there is one, then S itself can find and exploit it. Unlike the previ- 
ous non-self-referential methods based on hardwired proof searchers ||29l . Godel ma- 
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chines not only boast an optimal order of complexity but can optimally reduce (through 
self-changes) any slowdowns hidden by the 0()-notation, provided the utility of such 
speed-ups is provable. Compare [83, 86, 85|. 

A.IO Non-Universal But Still General and Practical RL Algorithms 

Recently there has been substantial progress in RL algorithms that are not quite as uni- 
versal as those above, but nevertheless capable of learning very general, program-like 
behavior. In particular, evolutionary methods 1531 l99l |27]| can be used for training Re- 
current Neural Networks (RNN), which are general computers. Many approaches to 
evolving RNN have been proposed 140] |T22l [HT] EH W\ [Ml US- One particularly 
effective family of methods uses cooperative coevolution to search the space of net- 
work components (neurons or individual synapses) instead of complete networks. The 
components are coevolved by combining them into networks, and selecting those for 
reproduction that participated in the best performing networks ll43l l20l ISTl [191 l22l l24l . 
Other recent RL techniques for RNN are based on the concept of policy gradients 
l|110l|119|[n8ll56l[T00llll7l . It will be of interest to evaluate variants of such control 
learning algorithms within the curiosity reward framework. 
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Figure 1: Previously published construction plan 1691 [88l of a female face (1998). 
Some human observers report they feel this face is 'beautiful.' Although the drawing 
has lots of noisy details (texture etc) without an obvious short description, positions 
and shapes of the basic facial features are compactly encodable through a very sim- 
ple geometrical scheme, simpler and much more precise than ancient facial proportion 
studies by Leonardo da Vinci and Albrecht Diirer. Hence the image contains a highly 
compressible algorithmic regularity or pattern describable by few bits of information. 
An observer can perceive it through a sequence of attentive eye movements or sac- 
cades, and consciously or subconsciously discover the compressibility of the incoming 
data stream. How was the picture made? First the sides of a square were partitioned 
into 2^ equal intervals. Certain interval boundaries were connected to obtain three ro- 
tated, superimposed grids based on lines with slopes ±1 or ±1/2^ or ±2^/1. Higher- 
resolution details of the grids were obtained by iteratively selecting two previously 
generated, neighboring, parallel lines and inserting a new one equidistant to both. Fi- 
nally the grids were vertically compressed by a factor of 1 — 2^"*. The resulting lines 
and their intersections define essential boundaries and shapes of eyebrows, eyes, lid 
shades, mouth, nose, and facial frame in a simple way that is obvious from the con- 
struction plan. Although this plan is simple in hindsight, it was hard to find: hundreds 
of my previous attempts at discovering such precise matches between simple geome- 
tries and pretty faces failed. 
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Figure 2: Image of a butterfly and a vase with a flower, reprinted from Leonardo 
ll67ll8Tl . An explanation of how the image was constructed and why it has a very short 
description is given in Figured 
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Figure 3: Explanation of how Figure |2] was constructed through a very simple algo- 
rithm exploiting fractal circles f67 1 . The frame is a circle; its leftmost point is the center 
of another circle of the same size. Wherever two circles of equal size touch or intersect 
are centers of two more circles with equal and half size, respectively. Each line of the 
drawing is a segment of some circle, its endpoints are where circles touch or intersect. 
There are few big circles and many small ones. In general, the smaller a circle, the 
more bits are needed to specify it. The drawing is simple (compressible) as it is based 
on few, rather large circles. Many human observers report that they derive a certain 
amount of pleasure from discovering this simplicity. The observer's learning process 
causes a reduction of the subjective complexity of the data, yielding a temporarily high 
derivative of subjective beauty: a temporarily steep learning curve. (Again I needed a 
long time to discover a satisfactory and rewarding way of using fractal circles to create 
a reasonable drawing.) 
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